This document uses the Sigle/Robinson book on Tidy Text Mining. Chapter 7 has a very useful case study for comparing twitter archives.

load libraries

library(rtweet)
library(tidyverse)
library(tidytext)
library(wordcloud2)

Get Tweets

The other notebook, 01_gather-tweets.Rmd, covers tweet gathering in more detail

search_tweets

tweet_collection <- search_tweets("lego star wars", n=10000, lang = "en")

Downloading [>----------------------------------------]   2%
Downloading [>----------------------------------------]   3%
Downloading [=>---------------------------------------]   4%
Downloading [=>---------------------------------------]   5%
Downloading [=>---------------------------------------]   6%
Downloading [==>--------------------------------------]   7%
Downloading [==>--------------------------------------]   8%
Downloading [===>-------------------------------------]   9%
Downloading [===>-------------------------------------]  10%
Downloading [====>------------------------------------]  11%
Downloading [====>------------------------------------]  12%
Downloading [====>------------------------------------]  13%
Downloading [=====>-----------------------------------]  14%
Downloading [=====>-----------------------------------]  15%
Downloading [======>----------------------------------]  16%
Downloading [======>----------------------------------]  17%
Downloading [======>----------------------------------]  18%
Downloading [=======>---------------------------------]  19%
Downloading [=======>---------------------------------]  20%
Downloading [========>--------------------------------]  21%
Downloading [========>--------------------------------]  22%
Downloading [========>--------------------------------]  23%
Downloading [=========>-------------------------------]  24%
Downloading [=========>-------------------------------]  25%
Downloading [==========>------------------------------]  26%
Downloading [==========>------------------------------]  27%
Downloading [==========>------------------------------]  28%
Downloading [===========>-----------------------------]  29%
Downloading [===========>-----------------------------]  30%
Downloading [============>----------------------------]  31%
Downloading [============>----------------------------]  32%
Downloading [=============>---------------------------]  33%
Downloading [=============>---------------------------]  34%
Downloading [=============>---------------------------]  35%
Downloading [==============>--------------------------]  36%
Downloading [==============>--------------------------]  37%
Downloading [===============>-------------------------]  38%
Downloading [===============>-------------------------]  39%
Downloading [===============>-------------------------]  40%
Downloading [================>------------------------]  41%
Downloading [================>------------------------]  42%
Downloading [=================>-----------------------]  43%
Downloading [=================>-----------------------]  44%
Downloading [=================>-----------------------]  45%
Downloading [==================>----------------------]  46%
Downloading [==================>----------------------]  47%
Downloading [===================>---------------------]  48%
Downloading [===================>---------------------]  49%
Downloading [===================>---------------------]  50%
Downloading [====================>--------------------]  51%
Downloading [====================>--------------------]  52%
Downloading [=====================>-------------------]  53%
Downloading [=====================>-------------------]  54%
Downloading [======================>------------------]  55%
Downloading [======================>------------------]  56%
Downloading [======================>------------------]  57%
Downloading [=======================>-----------------]  58%
Downloading [=======================>-----------------]  59%
Downloading [========================>----------------]  60%
Downloading [========================>----------------]  61%
Downloading [========================>----------------]  62%
Downloading [=========================>---------------]  63%
Downloading [=========================>---------------]  64%
Downloading [==========================>--------------]  65%
Downloading [==========================>--------------]  66%
Downloading [==========================>--------------]  67%
Downloading [===========================>-------------]  68%
Downloading [===========================>-------------]  69%
Downloading [============================>------------]  70%
Downloading [============================>------------]  71%
Downloading [=============================>-----------]  72%
Downloading [=============================>-----------]  73%
Downloading [=============================>-----------]  74%
Downloading [==============================>----------]  75%
Downloading [==============================>----------]  76%
Downloading [===============================>---------]  77%
Downloading [===============================>---------]  78%
Downloading [===============================>---------]  79%
Downloading [================================>--------]  80%
Downloading [================================>--------]  81%
Downloading [=================================>-------]  82%
Downloading [=================================>-------]  83%
Downloading [=================================>-------]  84%
Downloading [==================================>------]  85%
Downloading [==================================>------]  86%
Downloading [===================================>-----]  87%
Downloading [===================================>-----]  88%
Downloading [===================================>-----]  89%
Downloading [====================================>----]  90%
Downloading [====================================>----]  91%
Downloading [=====================================>---]  92%
Downloading [=====================================>---]  93%
tweet_collection.orig <- tweet_collection
tweet_collection <- tweet_collection %>% 
  filter(is_retweet == "FALSE") 

tweet_collection

Tokenize tweets

tweets_by_tweeter <- tweet_collection %>% 
  group_by(screen_name) %>% 
  mutate(line = row_number()) %>% 
  ungroup()

tweets_by_tweeter %>% 
  count(screen_name, sort = TRUE)

# glimpse(tweets_by_tweeter)

"Because we have kept text such as hashtags and usernames in the dataset, we can’t use a simple anti_join() to remove stop words. Instead, we can take the approach shown in the filter() line that uses str_detect() from the stringr package. – https://www.tidytextmining.com/twitter.html

tweets_tokenized <- tweets_by_tweeter %>% 
  select(text, screen_name, line) %>% 
  unnest_tokens(word, text, token = "tweets") %>%
  filter(!word %in% stop_words$word,
         !word %in% str_remove_all(stop_words$word, "'"),
         str_detect(word, "[a-z]")) 
Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
tweets_tokenized

stop words

The rtweet package has a special stopwords dataset that may come in handy with tweets….

head(stopwordslangs)
tweets_tokenized %>% 
  count(word, sort = TRUE, name = "freq") %>% 
  filter(!str_detect(word, "^\\@")) %>% 
  anti_join(stopwordslangs)           # anti_join(tidytext::get_stopwords())
Joining, by = "word"

Word frequencies

Calculate word frequency

frequency <- tweets_tokenized %>% 
  group_by(screen_name) %>% 
  count(word, sort = TRUE) %>% 
  left_join(tweets_tokenized %>% 
              group_by(screen_name) %>% 
              summarise(total = n())) %>%
  mutate(freq = n/total)
`summarise()` ungrouping output (override with `.groups` argument)
Joining, by = "screen_name"
frequency

"This is a nice and tidy data frame but we would actually like to plot those frequencies on the x- and y-axes of a plot, so we will need to use spread() from tidyr make a differently shaped data frame. – https://www.tidytextmining.com/twitter.html

pivot_wider

frequency <- frequency %>% 
  select(screen_name, word, freq) %>% 
  pivot_wider(names_from = screen_name, values_from = freq) #, values_fill = 0)

frequency 

Visualize

Word cloud

tweets_tokenized %>% 
  # group_by(screen_name) %>% 
  count(word, sort = TRUE, name = "freq") %>% 
  head(500) %>%
  filter(!str_detect(word, "^\\@")) %>% 
  anti_join(stopwordslangs)  %>% 
  wordcloud2(size = 2)
Joining, by = "word"

Frequencies are often better expressed with bars

tweets_tokenized %>% 
  count(word, sort = TRUE, name = "freq") %>% 
  filter(!str_detect(word, "^\\@")) %>% 
  slice_head(n = 30) %>% 
  ggplot(aes(freq, fct_reorder(word, freq))) +
  geom_col()


tweets_tokenized %>% 
  count(word, sort = TRUE, name = "freq") %>% 
  anti_join(stopwordslangs) %>%                    # twitter stop words.
  filter(!str_detect(word, "^\\@")) %>% 
  slice_head(n = 30) %>% 
  ggplot(aes(freq, fct_reorder(word, freq))) +
  geom_col()
Joining, by = "word"

Or plot the word frequency ratios of two tweeters https://www.tidytextmining.com/twitter.html#word-frequencies-1

ggplot(frequency, aes(RGONZALEZ1978, BrickFanatics)) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = scales::percent_format()) +
  scale_y_log10(labels = scales::percent_format()) +
  geom_abline(color = "firebrick")


# fs::dir_create("images")
# ggsave("images/dukeball.png")

# "CBBCent1" | screen_name == "Adam_Bradford1
# marchmadness  TheAndyKatz

# RGONZALEZ1978
# <dbl>
# BrickFanatics
# <dbl>


ggplot(frequency, aes(BrickFanatics, fantasysite)) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = scales::percent_format()) +
  scale_y_log10(labels = scales::percent_format()) +
  geom_abline(color = "firebrick")

Word Usage

https://www.tidytextmining.com/twitter.html#comparing-word-usage

tweets_by_tweeter %>% 
  summarise(min_date = min(created_at), max_date = max(created_at))
word_ratios <- tweets_tokenized %>%
  # filter(screen_name == "CBBCent1" | screen_name == "Adam_Bradford14") %>%
  filter(screen_name == "BrickFanatics" | screen_name == "fantasysite") %>% 
  filter(!str_detect(word, "^@")) %>%
  count(word, screen_name) %>%
  group_by(word) %>%
  filter(sum(n) >= 2) %>%
  ungroup() %>%
  pivot_wider(names_from = screen_name, values_from = n, values_fill = 0) %>%
  mutate_if(is.numeric, list(~(. + 1) / (sum(.) + 1))) %>%
  mutate(logratio = log(BrickFanatics / fantasysite)) %>%
  arrange(desc(logratio))

word_ratios

equal usage

word_ratios %>% 
  arrange(abs(logratio))
word_ratios %>%
  group_by(logratio < 0) %>%  # < 0) %>%
  top_n(15, abs(logratio)) %>%
  ungroup() %>%
  mutate(word = reorder(word, logratio)) %>%
  ggplot(aes(word, logratio, fill = logratio < 0)) +
  geom_col() + #show.legend = FALSE) +
  coord_flip() +
  ylab("log odds ratio (BrickFanatics/fantasysite)") +
  # scale_fill_discrete(name = "", labels = c("BrickFanatics", "fantasysite")) +
  scale_fill_brewer(palette = "Dark2", name = "", labels = c("BrickFanatics", "fantasysite"))

Favorites and retweets

https://www.tidytextmining.com/twitter.html#favorites-and-retweets

Changes in word use

https://www.tidytextmining.com/twitter.html#changes-in-word-use

Term Document Matrix

https://www.tidytextmining.com/tfidf.html#the-bind_tf_idf-function

tweet_words <- tweets_by_tweeter %>% 
  select(screen_name, text, status_id, user_id) %>%  
  unnest_tokens(word, text, token = "tweets") %>% 
  filter(!str_detect(word, "^\\@")) %>%
  filter(!str_detect(word, "^http")) %>%
  anti_join(stopwordslangs)  %>% 
  count(word, tweeter = screen_name, sort = TRUE)
Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
Joining, by = "word"
tweet_words

total_words <- tweet_words %>%
  group_by(tweeter) %>%
  summarize(total = sum(n)) %>% 
  arrange(-total)
`summarise()` ungrouping output (override with `.groups` argument)
total_words

tweet_words <- left_join(tweet_words, total_words)
Joining, by = "tweeter"
tweet_words
tweet_words %>% 
  bind_tf_idf(word, tweeter, n)
tweet_words %>% 
  bind_tf_idf(word, tweeter, n) %>% 
  arrange(desc(tf_idf)) %>% 
  mutate(word = factor(word, levels = rev(unique(word)))) %>% 
  filter(n > 10) %>% 
  # group_by(tweeter) %>% 
  # top_n(2) %>% 
  # ungroup() %>% 
  ggplot(aes(word, tf_idf)) +
  geom_col() +
  facet_wrap(~ tweeter) +
  coord_flip()

Resource list

---
title: "Rtweet"
subtitle: "an Rfun demonstration"
author: "John Little"
date: '`r Sys.Date()`'
output: html_notebook
---

This document uses the Sigle/Robinson book on [_Tidy Text Mining_](https://www.tidytextmining.com/). Chapter 7 has a very useful case study for comparing twitter archives.
 

## load libraries

```{r libraries, message=FALSE, warning=FALSE}
library(rtweet)
library(tidyverse)
library(tidytext)
library(wordcloud2)
```

## Get Tweets 

The other notebook, 01_gather-tweets.Rmd, covers tweet gathering in more detail

search_tweets
```{r getTweets}
tweet_collection <- search_tweets("lego star wars", n=10000, lang = "en")

tweet_collection.orig <- tweet_collection
```


```{r process Tweets}
tweet_collection <- tweet_collection %>% 
  filter(is_retweet == "FALSE") 

tweet_collection
```


## Tokenize tweets

```{r corpus2vector}
tweets_by_tweeter <- tweet_collection %>% 
  group_by(screen_name) %>% 
  mutate(line = row_number()) %>% 
  ungroup()

tweets_by_tweeter %>% 
  count(screen_name, sort = TRUE)

# glimpse(tweets_by_tweeter)
```


> "Because we have kept text such as hashtags and usernames in the dataset, we can’t use a simple anti_join() to remove stop words. Instead, we can take the approach shown in the filter() line that uses str_detect() from the stringr package. -- https://www.tidytextmining.com/twitter.html


```{r tokenized tweets}
tweets_tokenized <- tweets_by_tweeter %>% 
  select(text, screen_name, line) %>% 
  unnest_tokens(word, text, token = "tweets") %>%
  filter(!word %in% stop_words$word,
         !word %in% str_remove_all(stop_words$word, "'"),
         str_detect(word, "[a-z]")) 

tweets_tokenized
```

## stop words

The `rtweet` package has a special stopwords dataset that may come in handy with tweets....

```{r}
head(stopwordslangs)
```


```{r}
tweets_tokenized %>% 
  count(word, sort = TRUE, name = "freq") %>% 
  filter(!str_detect(word, "^\\@")) %>% 
  anti_join(stopwordslangs)           # anti_join(tidytext::get_stopwords())
```


## Word frequencies

### Calculate word frequency

```{r}
frequency <- tweets_tokenized %>% 
  group_by(screen_name) %>% 
  count(word, sort = TRUE) %>% 
  left_join(tweets_tokenized %>% 
              group_by(screen_name) %>% 
              summarise(total = n())) %>%
  mutate(freq = n/total)

frequency
```

> "This is a nice and tidy data frame but we would actually like to plot those frequencies on the x- and y-axes of a plot, so we will need to use spread() from tidyr make a differently shaped data frame. -- https://www.tidytextmining.com/twitter.html

pivot_wider

```{r}
frequency <- frequency %>% 
  select(screen_name, word, freq) %>% 
  pivot_wider(names_from = screen_name, values_from = freq) #, values_fill = 0)

frequency 
```

### Visualize

Word cloud

```{r word-cloud}
tweets_tokenized %>% 
  # group_by(screen_name) %>% 
  count(word, sort = TRUE, name = "freq") %>% 
  head(500) %>%
  filter(!str_detect(word, "^\\@")) %>% 
  anti_join(stopwordslangs)  %>% 
  wordcloud2(size = 2)

```

Frequencies are often better expressed with bars

```{r word-freq barplot, fig.height=7}
tweets_tokenized %>% 
  count(word, sort = TRUE, name = "freq") %>% 
  filter(!str_detect(word, "^\\@")) %>% 
  slice_head(n = 30) %>% 
  ggplot(aes(freq, fct_reorder(word, freq))) +
  geom_col()

tweets_tokenized %>% 
  count(word, sort = TRUE, name = "freq") %>% 
  anti_join(stopwordslangs) %>%                    # twitter stop words.
  filter(!str_detect(word, "^\\@")) %>% 
  slice_head(n = 30) %>% 
  ggplot(aes(freq, fct_reorder(word, freq))) +
  geom_col()
```

Or plot the word frequency ratios of two tweeters
https://www.tidytextmining.com/twitter.html#word-frequencies-1

```{r word_freq plot}
ggplot(frequency, aes(RGONZALEZ1978, BrickFanatics)) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = scales::percent_format()) +
  scale_y_log10(labels = scales::percent_format()) +
  geom_abline(color = "firebrick")

ggplot(frequency, aes(BrickFanatics, fantasysite)) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = scales::percent_format()) +
  scale_y_log10(labels = scales::percent_format()) +
  geom_abline(color = "firebrick")
```


## Word Usage

https://www.tidytextmining.com/twitter.html#comparing-word-usage


```{r}
tweets_by_tweeter %>% 
  summarise(min_date = min(created_at), max_date = max(created_at))
```


```{r}
word_ratios <- tweets_tokenized %>%
  # filter(screen_name == "CBBCent1" | screen_name == "Adam_Bradford14") %>%
  filter(screen_name == "BrickFanatics" | screen_name == "fantasysite") %>% 
  filter(!str_detect(word, "^@")) %>%
  count(word, screen_name) %>%
  group_by(word) %>%
  filter(sum(n) >= 2) %>%
  ungroup() %>%
  pivot_wider(names_from = screen_name, values_from = n, values_fill = 0) %>%
  mutate_if(is.numeric, list(~(. + 1) / (sum(.) + 1))) %>%
  mutate(logratio = log(BrickFanatics / fantasysite)) %>%
  arrange(desc(logratio))

word_ratios
```

### equal usage

```{r}
word_ratios %>% 
  arrange(abs(logratio))
```


```{r word-usage}
word_ratios %>%
  group_by(logratio < 0) %>%  # < 0) %>%
  top_n(15, abs(logratio)) %>%
  ungroup() %>%
  mutate(word = reorder(word, logratio)) %>%
  ggplot(aes(word, logratio, fill = logratio < 0)) +
  geom_col() + #show.legend = FALSE) +
  coord_flip() +
  ylab("log odds ratio (BrickFanatics/fantasysite)") +
  # scale_fill_discrete(name = "", labels = c("BrickFanatics", "fantasysite")) +
  scale_fill_brewer(palette = "Dark2", name = "", labels = c("BrickFanatics", "fantasysite"))
```

## Favorites and retweets

https://www.tidytextmining.com/twitter.html#favorites-and-retweets

## Changes in word use

https://www.tidytextmining.com/twitter.html#changes-in-word-use

## Term Document Matrix

https://www.tidytextmining.com/tfidf.html#the-bind_tf_idf-function

```{r}
tweet_words <- tweets_by_tweeter %>% 
  select(screen_name, text, status_id, user_id) %>%  
  unnest_tokens(word, text, token = "tweets") %>% 
  filter(!str_detect(word, "^\\@")) %>%
  filter(!str_detect(word, "^http")) %>%
  anti_join(stopwordslangs)  %>% 
  count(word, tweeter = screen_name, sort = TRUE)

tweet_words

total_words <- tweet_words %>%
  group_by(tweeter) %>%
  summarize(total = sum(n)) %>% 
  arrange(-total)

total_words

tweet_words <- left_join(tweet_words, total_words)

tweet_words
```

```{r}
tweet_words %>% 
  bind_tf_idf(word, tweeter, n)
```

```{r fig.height=12}
tweet_words %>% 
  bind_tf_idf(word, tweeter, n) %>% 
  arrange(desc(tf_idf)) %>% 
  mutate(word = factor(word, levels = rev(unique(word)))) %>% 
  filter(n > 10) %>% 
  # group_by(tweeter) %>% 
  # top_n(2) %>% 
  # ungroup() %>% 
  ggplot(aes(word, tf_idf)) +
  geom_col() +
  facet_wrap(~ tweeter) +
  coord_flip()
```


## Resource list

- http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know

- http://antonio-ferraro.eu.pn/word-clouds-in-r-packages-wordcloud2-and-tm/

- https://jrnold.github.io/qss-tidy/discovery.html#textual-data

- https://rstudio-pubs-static.s3.amazonaws.com/31867_8236987cf0a8444e962ccd2aec46d9c3.html

- misc

    - http://www.cookbook-r.com/Manipulating_data/Converting_between_data_frames_and_contingency_tables/
    - https://www.r-bloggers.com/how-to-get-the-frequency-table-of-a-categorical-variable-as-a-data-frame-in-r/
    - https://www.quora.com/How-do-I-get-a-frequency-count-based-on-two-columns-variables-in-an-R-dataframe
    - https://www.quora.com/How-do-you-create-a-corpus-from-a-data-frame-in-R
    

